1. Introduction

In October 7 2023, the ongoing Israel-Hamas conflict initiated. At the same time, since February 24 2022, Russian and Ukrainian forces are fighting in a continuous struggle. Both conflicts have had a huge impact on the world economy while the cost in human life is detrimental.

Source: Financial Times-Ukraine’s counteroffensive against Russia in maps: latest updates.” Accessed: Jul. 07, 2024.
Source: Financial Times-Ukraine’s counteroffensive against Russia in maps: latest updates.” Accessed: Jul. 07, 2024.

U.S.A has been the main source of financial and military aid to Ukraine [2] so far into the war. Nevertheless, the current Israel-Hamas War, has pressured the U.S. budget, and there are ongoing talks within the U.S. senate in regard to the volume and direction of the provided aid.

A growing number of republican senators seem to be opposed of the volume of the support given [3], while democrats appear to be strongly in favor of not disrupting financial support in the war. There are voices that support a more internationally isolated U.S.A that does not interfere with world conflicts and problems and focuses more on internal matters. These voices, mainly from the republican spectrum, appear to oppose U.S.A intervention in both conflicts.

Source: Pew Research Center
Source: Pew Research Center

As can be indicated in the above figure from [4], the american public seems to be separated in terms of supporting the Ukrainian effort in the war. The percentage of republicans who support the war appears to be less than the percentage of democrats who support the war.

-How the research notebook is organized

Disclaimer: ChatGPT 3.5/4.0 was used in this project, mainly for code debugging and text clean-up.

2. Research Question

Main question

Does the political affiliation (liberal or conservative) of a newspaper, play a role on how topics fluctuate through time, and which topics are the most dominant? Does the difference in political affiliation have an impact on the sentiment of published articles? Will a certain newspaper be more or less in favor of providing aid in both or either wars.

Subquestions

To address the above questions articles were collected, on a daily and weekly level, from two main U.S. newspapers, the Wall Street Journal and the New York Times, starting from November 2023 (start of Israel-Hamas War) to June 2024 (current time). According to Boston’s university libraries, [6] the Wall Street Journal leans towards a more conservative political view, while the New York Times seems to follow a more liberal political view. A total of 2621 articles were collected, 1411 from New York Times and 1210 from the Wall Street Journal.

3. Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(httr)
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten
library(dplyr)
library(devtools)
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 4.3.3
library(quanteda) 
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
## Package version: 4.0.2
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.
library(quanteda.textplots)
library(quanteda.textstats)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
library(udpipe) 
library(spacyr)
library(tm)
## Warning: package 'tm' was built under R version 4.3.3
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following objects are masked from 'package:quanteda':
## 
##     meta, meta<-
## 
## The following object is masked from 'package:httr':
## 
##     content
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## 
## Attaching package: 'tm'
## 
## The following object is masked from 'package:quanteda':
## 
##     stopwords
library(lubridate)
library(spacyr)
library(topicmodels)
library("ldatuning")
library(slam)
library(tidytext)
library(LDAvis)
library(alluvial)
library(patchwork)
library(tinytex)
library(RColorBrewer)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:httr':
## 
##     progress
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(syuzhet)
## 
## Attaching package: 'syuzhet'
## 
## The following object is masked from 'package:spacyr':
## 
##     get_tokens

4. Data Retrieval

Two methods were used to retrieve data, New York Times API [5] and ProQuest. New York Times API is a free access API, with limitation that access is not given to the full content of the articles but only to the lead paragraph. ProQuest also has the limitation that the data are provided in text format. Therefore certain text processing was needed to transform the data into tabular format.

New York Times API

# # Set your API key
# api_key <- ""
# 
# # Set the base URL for the New York Times API
# base_url <- "https://api.nytimes.com/svc/search/v2/articlesearch.json"
# 
# # Define the parameters for the API request
# params <- list(
#   q = "",
#   fq = 'headline:("Ukraine" "Israel") AND document_type:("article")',
#   begin_date = "20231101", # Specify your start date in YYYYMMDD format
#   end_date = "20240601",   # Specify your end date in YYYYMMDD format
#   `api-key` = api_key,
#   page=0,
#   sort="oldest"
# )
# 
# # Make the initial API request to get metadata
# response <- GET(base_url, query = params)
# content <- content(response, "text", encoding = "UTF-8")
# articles <- fromJSON(content)
# article_list_87 <- articles$response$docs
# 
# # Calculate the number of pages
# maxPages <- round(articles$response$meta$hits / 10) - 1
# # maxPages <- 100
# # Initialize a list to store all pages of results
# # maxPages <- 100
# 
# pages <- list()
# 
# # Set your API key
# api_key <- ""
# 
# for(i in 0:maxPages){
#   # Set the base URL for the New York Times API
#   
#   base_url <- "https://api.nytimes.com/svc/search/v2/articlesearch.json"
#   
#   params <- list(
#     q = "",
#     fq = 'headline:("Ukraine" "Israel") AND document_type:("article")',
#     begin_date = "20231101", # Specify your start date in YYYYMMDD format
#     end_date = "20240601",   # Specify your end date in YYYYMMDD format
#     `api-key` = api_key,
#     page=i,
#     sort = "oldest"
#   )
#   response <- GET(base_url, query = params)
#   content <- content(response, "text", encoding = "UTF-8")
#   articles <- fromJSON(content)
#   articles_list <- articles$response$docs
#   
#   message("Retrieving page ", i)
#   pages[[i+1]] <- articles_list
#   Sys.sleep(15) 
# }
# 
# allNYTSearch <- rbind_pages(pages)
# 
# 
# liberal_after <- allNYTSearch[, c("abstract", 
#                                   "web_url", 
#                                   "lead_paragraph",
#                                   "source",
#                                   "pub_date",
#                                   "_id",
#                                   "document_type")]
# 
# na_count_per_column <- colSums(is.na(allNYTSearch))
# 
# headlines <- allNYTSearch$headline[["main"]]
# 
# liberal_after$headlines <- headlines

Duplicates: Identification and Removal for the New York Times Articles

Duplicate rows are removed based on identical lead paragraph and identical url of the articles.

# duplicates <- duplicated(liberal_after[, "lead_paragraph"])
# print(sum(duplicates))
# 
# duplicates <- duplicated(liberal_after[, "web_url"])
# print(sum(duplicates))
# 
# duplicate_rows <- liberal_after[duplicates, ]
# print(duplicate_rows)
# 
# liberal_after_clean <- liberal_after[!duplicated(liberal_after[, "lead_paragraph"]), ]

Data Retrieval from ProQuest

A function for transforming txt file into tabular form is created and applied to the imported txt file of the articles.

txt_to_dataframe <- function(filepath) {
  
  file_path <- filepath
  
  text_file <- readLines(file_path)
  sections <- list()
  
  # Initialize variables to track section boundaries
  start_index <- 1
  
  # Loop through the vector to identify and split sections
  for (i in seq_along(text_file)) {
    if (text_file[i] == "") {
      # Found an empty string, so split the section
      sections[[length(sections) + 1]] <- text_file[start_index:(i - 1)]
      start_index <- i + 1
    }
  }
  
  # Add the last section if the vector doesn't end with an empty string
  if (start_index <= length(text_file)) {
    sections[[length(sections) + 1]] <- text_file[start_index:length(text_file)]
  }
  
  for (i in seq_along(sections)) {
    # Check if the element is a character vector
    if (is.character(sections[[i]])) {
      # Combine the elements into a single character string
      sections[[i]] <- paste(sections[[i]], collapse = " ")
    }
  }
  
  if (file_path == "C:\\KU Leuven\\Collecting Big Data for Social Sciences\\November23_Now_Wall_Street_Israel.txt") {
    pattern_dates <- "^Publication date:"
    
    # Use grep to find lines matching the pattern
    dates <- grep(pattern_dates, sections, value = TRUE)
    dates <- str_replace(dates, "Publication date: ", "")
    dates <- c(dates, NA)
  } else {
    
    pattern_dates <- "^Publication date:"
    
    # Use grep to find lines matching the pattern
    dates <- grep(pattern_dates, sections, value = TRUE)
    dates <- str_replace(dates, "Publication date: ", "")
  }
  
  pattern_titles <- "^Title:"
  
  # Use grep to find lines matching the pattern
  titles <- grep(pattern_titles, sections, value = TRUE)
  titles <- str_replace(titles, "Title: ", "")
  
  # pattern_dates <- "^Publication date:"
  # # Use grep to find lines matching the pattern
  # dates <- grep(pattern_dates, sections, value = TRUE)
  # dates <- str_replace(dates, "Publication date: ", "")
  # dates <- c(dates, NA)
  
  pattern_text <- "^Full text:"
  
  # Use grep to find lines matching the pattern
  text <- grep(pattern_text, sections, value = TRUE)
  text <- str_replace(text, "Full text: ", "")
  
  new_dataframe <- data.frame(Date = dates, Title = titles, Text = text)
  
  return(new_dataframe)
}

Create dataframes for both Publishers

# after_nov23_ukraine_wall_street <- txt_to_dataframe("C:\\KU Leuven\\Collecting Big Data for Social Sciences\\November23_Now_Wall_Street_News.txt")
# 
# # Create Publisher Column
# after_nov23_ukraine_wall_street$Publisher <- "Wall_Street_News"
# 
# after_nov23_israel_wall_street <- txt_to_dataframe("C:\\KU Leuven\\Collecting Big Data for Social Sciences\\November23_Now_Wall_Street_Israel.txt")
# 
# # Create Publisher Column
# after_nov23_israel_wall_street$Publisher <- "Wall_Street_News"
####### Merge into a single dataframe 
# 
# conservative_after <- rbind(after_nov23_ukraine_wall_street, 
#                             after_nov23_israel_wall_street
# )
#
# conservative_after$Date <- mdy(conservative_after$Date)
# conservative_after$Month_Year <- format(conservative_after$Date, "%b %Y")
# conservative_after <- na.omit(conservative_after)
# 
# head(conservative_after)

Duplicates: Identification and Removal of Duplicates for the Wall Street Journal Articles

########### Duplicates

# duplicates <- duplicated(conservative_after[, c("Title","Text")])
# print(sum(duplicates))
# 
# duplicate_rows <- conservative_after[duplicates, ]
# print(duplicate_rows)
# 
# conservative_after <- conservative_after[!duplicated(conservative_after[, c("Title","Text")]), ]
# 
# duplicates <- duplicated(conservative_after[, c("Title")])
# print(sum(duplicates))
# 
# conservative_after <- conservative_after[!duplicated(conservative_after[, "Title"]), ]
# 
# duplicates <- duplicated(conservative_after[, c("Text")])
# print(sum(duplicates))
# 
# conservative_after <- conservative_after[!duplicated(conservative_after[, "Text"]), ]

5. Import Datasets from Local PC

Following the retrieval and cleaning of The New York Times (Liberal) and Wall Street Journal (Conservative) datasets, the datasets were stored in local storage. In the code below the “liberal_after_clean” and “conservative_after” files are imported for further use.

################### Import Datasets 

liberal_after <- read.csv("/Users/alessandrosalvatori/Desktop/KU LEUVEN/EXAMS/SECOND YEAR/RETAKES/COLLECTING AND ANALYZING BIG DATA FOR SOCIAL SCIENCES/PROJECT/liberal_after_clean.csv")

conservative_after <- read.csv("/Users/alessandrosalvatori/Desktop/KU LEUVEN/EXAMS/SECOND YEAR/RETAKES/COLLECTING AND ANALYZING BIG DATA FOR SOCIAL SCIENCES/PROJECT/conservative_after.csv")

liberal_after <- liberal_after[, c("headlines", 
                                   "abstract",
                                   "lead_paragraph",
                                   "source",
                                   "pub_date")]
head(liberal_after)
head(conservative_after)

Date columns are converted into date type for better handling. In addition, regarding articles from the New York Times, the lead_paragraph is combined with the abstract. This is done in order to provide more information as input to the chosen topic model used for topic allocation of each article.

# Convert to date type 

liberal_after$pub_date <- ymd_hms(liberal_after$pub_date)
liberal_after$Month_Year <- format(liberal_after$pub_date, "%b %Y")

liberal_after <- liberal_after %>% select(-pub_date)
conservative_after <- conservative_after %>% select(-Date)

# Combine abstract with lead_paragraph to help topic modelling 

liberal_after <- liberal_after %>% 
  mutate(
    Text = paste(abstract, lead_paragraph, sep = " ")
  )

liberal_after <- liberal_after %>% select(-abstract,-lead_paragraph)

liberal_after  <- liberal_after  %>%
  rename(
    Title = headlines,
    Publisher = source,
    text = Text
  )

conservative_after  <- conservative_after  %>%
  rename(
    text = Text
  )

Combine both Wall Street Journal and New York Times Articles into a single Dataframe

# Combine both dataframes into a single dataframe for all newspapers

liberal_after <- liberal_after %>% select(Title, text, Month_Year, Publisher)

conservative_after <- conservative_after %>% select(Title, text, Month_Year, Publisher)

newspapers <- rbind(conservative_after, liberal_after)

articles_per_newspaper <- newspapers %>%
  group_by(Publisher) %>%
  summarise(count = n())

# articles_per_newspaper
newspapers <- newspapers %>%
  mutate(Publisher = ifelse(Publisher == "International New York Times", "The New York Times", Publisher))

articles_per_newspaper <- newspapers %>%
  group_by(Publisher) %>%
  summarise(count = n())
# articles_per_newspaper

head(articles_per_newspaper)
head(newspapers)

6. Text Preprocessing

The above techniques are used in order to include only the most important information from the text and filter out redundant information such as punctuation and similar words. The steps followed to apply the preprocessing steps are inspired and understood as in [7].

Use of General Expressions to Remove email and redudant information from each article”

print(newspapers[3,2])
## [1] "KYIV, Ukraine<U+2014>In 2020, Vitaliy Yatsenko went to pick up a parcel containing illegal amphetamines from a Kyiv post office and was met by 10 policemen and detained. This week he will cut short his five-year prison sentence to join Ukraine's stretched armed forces. In a sign of the Ukrainian military's desperate need for fresh troops, Kyiv is taking a leaf out of Russia's playbook by recruiting inmates from prisons to serve in its armed forces. The government says that 4,656 convicts have already applied for the program in which prisoners will have to serve till the end of the war before winning their freedom. Kyiv is faced with stark choices as an initial wave of volunteers fades and they lose ground against an enemy that can draw on a population 3<U+00BD> times as large. Many front-line units say they are depleted and exhausted, and Ukraine is struggling to draft enough men to hold off mounting Russian offensives. In search of hundreds of thousands of new soldiers, Ukraine has lowered the age of mobilization, increased financial compensation for troops and sought to coerce military-age men who fled abroad to return home and fight. This week, Yatsenko will leave his prison cell and join the military. For men like this 23-year-old, the program is a chance for redemption. \"I feel ashamed to be in prison,\" he said in an interview at the jail where he is being held. \"This is my chance to be useful.\" Yatsenko doesn't know where he will be sent or what role he will be given. He has yet to tell his mother, but said he is driven in part by a desire to make her proud following his incarceration. Convicts have been used in wartime through much of history, often in the most dangerous roles. Napoleon deployed penal brigades and both Nazi Germany and the Soviet Union drafted criminals and political prisoners. After World War II the practice ended in many countries, not least because there was no need for large-scale mobilization. The Ukraine war has led to a resurgence. Russia's Wagner militia began to recruit convicts soon after its February 2022 invasion started to go awry. Moscow continued the practice after Wagner's leader, Yevgeny Prigozhin, rebelled against the military leadership and died in a plane crash in August last year. Ukraine's program will differ in several respects. Unlike in Russia, those convicted of certain crimes won't be eligible. That includes those with convictions for sexual violence, traffic accidents that led to deaths, and murder if it was of more than one person or carried out with \"particular cruelty,\" among other restrictions, said <U+041E>lena Vysotska, a deputy Ukrainian justice minister. While Russian prisoners will mainly get their criminal record expunged after service, Ukrainians won't. Ukraine's Ministry of Justice estimates that authorities can recruit around 5,000 people from prisons. Russia never confirmed the total number of convicts it recruited but figures from the prison service show a reduction of more than 35,000 in the country's total prison population between May 2022 and January 2023, the peak of Wagner's recruitment. A senior official at Yatsenko's prison said several convicts with more serious criminal records have been told their convictions bar them from serving, leaving them disappointed. Likewise, some have expressed interest, only to back down when informed of the risks, he said. Convicts will be placed in special units, but it isn't clear what they will be tasked to do. Russia's Wagner units were used in late 2022 and early 2023 in risky attack waves on the city of Bakhmut that resulted in thousands of deaths. Ukraine's Ministry of Defense didn't immediately comment, though the country tends to take fewer risks with its soldiers than Russia does. Volodymyr Barandich, another recruit, said he is impatient to leave jail for a front-line position. Around six months ago Barandich was an army corporal serving around the town of Avdiivka, one of the front line's most dangerous hot spots , when he was sentenced for a drug-dealing offense. Barandich maintains his innocence and said he was set up by a former friend. \"I felt ashamed, because I was in here and my colleagues were still at the front,\" he said. He has almost five years of his sentence to run. The 32-year-old had been in the military for six years when he was jailed. During his time in prison he said he never lost the ambition to return to the front line. Then in May, he was in a prison workshop when another convict told him that a law had passed that would allow those in jail to serve. \"Finally,\" he said he remembers thinking. Neither Barandich nor Yatsenko say they are nervous about fighting. Barandich's girlfriend Alina said that she is nervous. But she says she supports the decision of a man who has always felt at ease in the military. \"Why should he be in prison if he can fully serve his country?\" she said. Yatsenko grew up impoverished in Kyiv in a single-parent household. He says that he dealt drugs because he wanted the money. Embarrassed by the conviction, his girlfriend left him. On hearing of his arrest, his mother got angry and screamed that he was stupid. \"I was stupid,\" he said. While the program has been broadly welcomed in Ukraine, some have expressed concern on social media about how armed convicts will be controlled. The initial round of Russian convicts could leave the army after six months and after returning to civilian life some committed serious crimes, including murder . Ukraine officials say its program takes on convicts of less serious crimes than Russia's. Those who have committed a murder can apply but their application must go through a risk assessment with the prison, judicial and prosecution service, said Vysotska from the Ministry of Justice. Vysotska said there are patriots among convicts who want to rehabilitate themselves. A prison service should emphasize correcting behavior and resocializing people for outside life, not incarceration for the sake of it, she said. Yatsenko says other prisoners told him they will see how he and other convicts fare before deciding. On a recent visit to their prison, bored-looking men stood in courtyards smoking. Some labored under a hot sun making concrete obstacles known as dragon's teeth for the military. \"But prison life is like a summer holiday camp\" compared with the front, said Barandich. Oksana Pyrozhok and Ievgeniia Sivorka contributed to this article. Write to Alistair MacDonald at Alistair.Macdonald@wsj.com Credit: By Alistair MacDonald | Photographs by Serhii Korovayny for The Wall Street Journal"

As it can be observed from the above article, at the end of each article there is a sentence starting with “Credit:”. In addition, in certain articles, there is email information such as “” Both the email information and the part after and including “Credit:” of the the text will be removed, using general expression patterns as done below:

###### Text Preprocessing 

# Remove email patterns and Credit:.... from the end of paragraphs 

remove_emails_credit <- function(article) {
  email_pattern <- "\\b[\\w.%+-]+@[\\w.-]+\\.[a-zA-Z]{2,}\\b"
  credit_pattern <- "Credit:.*$"
  article <- str_remove_all(article, email_pattern)
  article <- str_remove_all(article, credit_pattern)
  article  <- trimws(article)
  return(article)
}

newspapers$text <- sapply(newspapers$text, remove_emails_credit)

print(newspapers[3,2])
## [1] "KYIV, Ukraine<U+2014>In 2020, Vitaliy Yatsenko went to pick up a parcel containing illegal amphetamines from a Kyiv post office and was met by 10 policemen and detained. This week he will cut short his five-year prison sentence to join Ukraine's stretched armed forces. In a sign of the Ukrainian military's desperate need for fresh troops, Kyiv is taking a leaf out of Russia's playbook by recruiting inmates from prisons to serve in its armed forces. The government says that 4,656 convicts have already applied for the program in which prisoners will have to serve till the end of the war before winning their freedom. Kyiv is faced with stark choices as an initial wave of volunteers fades and they lose ground against an enemy that can draw on a population 3<U+00BD> times as large. Many front-line units say they are depleted and exhausted, and Ukraine is struggling to draft enough men to hold off mounting Russian offensives. In search of hundreds of thousands of new soldiers, Ukraine has lowered the age of mobilization, increased financial compensation for troops and sought to coerce military-age men who fled abroad to return home and fight. This week, Yatsenko will leave his prison cell and join the military. For men like this 23-year-old, the program is a chance for redemption. \"I feel ashamed to be in prison,\" he said in an interview at the jail where he is being held. \"This is my chance to be useful.\" Yatsenko doesn't know where he will be sent or what role he will be given. He has yet to tell his mother, but said he is driven in part by a desire to make her proud following his incarceration. Convicts have been used in wartime through much of history, often in the most dangerous roles. Napoleon deployed penal brigades and both Nazi Germany and the Soviet Union drafted criminals and political prisoners. After World War II the practice ended in many countries, not least because there was no need for large-scale mobilization. The Ukraine war has led to a resurgence. Russia's Wagner militia began to recruit convicts soon after its February 2022 invasion started to go awry. Moscow continued the practice after Wagner's leader, Yevgeny Prigozhin, rebelled against the military leadership and died in a plane crash in August last year. Ukraine's program will differ in several respects. Unlike in Russia, those convicted of certain crimes won't be eligible. That includes those with convictions for sexual violence, traffic accidents that led to deaths, and murder if it was of more than one person or carried out with \"particular cruelty,\" among other restrictions, said <U+041E>lena Vysotska, a deputy Ukrainian justice minister. While Russian prisoners will mainly get their criminal record expunged after service, Ukrainians won't. Ukraine's Ministry of Justice estimates that authorities can recruit around 5,000 people from prisons. Russia never confirmed the total number of convicts it recruited but figures from the prison service show a reduction of more than 35,000 in the country's total prison population between May 2022 and January 2023, the peak of Wagner's recruitment. A senior official at Yatsenko's prison said several convicts with more serious criminal records have been told their convictions bar them from serving, leaving them disappointed. Likewise, some have expressed interest, only to back down when informed of the risks, he said. Convicts will be placed in special units, but it isn't clear what they will be tasked to do. Russia's Wagner units were used in late 2022 and early 2023 in risky attack waves on the city of Bakhmut that resulted in thousands of deaths. Ukraine's Ministry of Defense didn't immediately comment, though the country tends to take fewer risks with its soldiers than Russia does. Volodymyr Barandich, another recruit, said he is impatient to leave jail for a front-line position. Around six months ago Barandich was an army corporal serving around the town of Avdiivka, one of the front line's most dangerous hot spots , when he was sentenced for a drug-dealing offense. Barandich maintains his innocence and said he was set up by a former friend. \"I felt ashamed, because I was in here and my colleagues were still at the front,\" he said. He has almost five years of his sentence to run. The 32-year-old had been in the military for six years when he was jailed. During his time in prison he said he never lost the ambition to return to the front line. Then in May, he was in a prison workshop when another convict told him that a law had passed that would allow those in jail to serve. \"Finally,\" he said he remembers thinking. Neither Barandich nor Yatsenko say they are nervous about fighting. Barandich's girlfriend Alina said that she is nervous. But she says she supports the decision of a man who has always felt at ease in the military. \"Why should he be in prison if he can fully serve his country?\" she said. Yatsenko grew up impoverished in Kyiv in a single-parent household. He says that he dealt drugs because he wanted the money. Embarrassed by the conviction, his girlfriend left him. On hearing of his arrest, his mother got angry and screamed that he was stupid. \"I was stupid,\" he said. While the program has been broadly welcomed in Ukraine, some have expressed concern on social media about how armed convicts will be controlled. The initial round of Russian convicts could leave the army after six months and after returning to civilian life some committed serious crimes, including murder . Ukraine officials say its program takes on convicts of less serious crimes than Russia's. Those who have committed a murder can apply but their application must go through a risk assessment with the prison, judicial and prosecution service, said Vysotska from the Ministry of Justice. Vysotska said there are patriots among convicts who want to rehabilitate themselves. A prison service should emphasize correcting behavior and resocializing people for outside life, not incarceration for the sake of it, she said. Yatsenko says other prisoners told him they will see how he and other convicts fare before deciding. On a recent visit to their prison, bored-looking men stood in courtyards smoking. Some labored under a hot sun making concrete obstacles known as dragon's teeth for the military. \"But prison life is like a summer holiday camp\" compared with the front, said Barandich. Oksana Pyrozhok and Ievgeniia Sivorka contributed to this article. Write to Alistair MacDonald at"

Email patterns and articles needless information has now been removed.

Define the corpus of all the articles

# Investigate the corpus 

corpus_news = corpus(newspapers)
corpus_news
## Corpus consisting of 2,621 documents and 3 docvars.
## text1 :
## "KYIV, Ukraine -- In 2020, Vitaliy Yatsenko picked up a parce..."
## 
## text2 :
## "Iryna Tsybukh rescued the wounded from Ukraine's bloodiest b..."
## 
## text3 :
## "KYIV, Ukraine<U+2014>In 2020, Vitaliy Yatsenko went to pick ..."
## 
## text4 :
## "Iryna Tsybukh rescued the wounded from Ukraine's bloodiest b..."
## 
## text5 :
## "RIVNE, Ukraine -- After Russia's full-scale military invasio..."
## 
## text6 :
## "RIVNE, Ukraine<U+2014>After Russia's full-scale military inv..."
## 
## [ reached max_ndoc ... 2,615 more documents ]

Use of General Expressions to identify all types of punctuation present in the articles

# Create function to identify punctuation symbols throughout the corpus 

extract_punctuation <- function(text) {
  pattern <- "[^a-zA-Z ]"
  extracted <- str_extract_all(text, pattern)
  extracted <- unlist(extracted)
  extracted <- extracted[!is.na(extracted)]
  return(extracted)
}

# Identify punctuation symbols 

punctuation_in_corpus <- extract_punctuation(corpus_news)
print(unique(punctuation_in_corpus))
##  [1] ","  "-"  "2"  "0"  "1"  "."  "'"  "4"  "6"  "5"  "3"  "/"  "\"" "$"  "7" 
## [16] "8"  ":"  "9"  "<"  "+"  ">"  "?"  ";"  "%"  "("  ")"  "["  "]"  "&"  "!" 
## [31] "@"  "#"  "_"  "="  "*"  "“"  "”"  "’"  "—"  "ó"  "‘"  "è"  "é"  "á"  "ö" 
## [46] "ü"  "à"  "­"  "ı"  "Ü"  "â"

Remove Punctuation from all the Articles

# Create function to remove digits and punctuation characters 

remove_punctuation <- function(article) {
  
  pattern_1 <- "[[:digit:]]"
  pattern_2 <- "[[:punct:]]"
  symbols <- c("B=", "<", ">", "+", "b   ", ".", "=", "|","","$","b,","B#","B%")
  pattern_3 <- paste0("[", paste(symbols, collapse = ""), "]")
  new_article <- article %>% 
    str_replace_all(pattern_1, " ") %>% 
    str_replace_all(pattern_2, " ") %>% 
    str_replace_all(pattern_3, " ")
  
  return(new_article)
}

# Remove punctuation from the text column of the dataframe 

newspapers$text <- sapply(newspapers$text, remove_punctuation)

# Investigate the corpus again 

corpus_news = corpus(newspapers)
corpus_news
## Corpus consisting of 2,621 documents and 3 docvars.
## text1 :
## "KYIV  Ukraine    In       Vitaliy Yatsenko picked up a parce..."
## 
## text2 :
## "Iryna Tsy ukh rescued the wounded from Ukraine s  loodiest  ..."
## 
## text3 :
## "KYIV  Ukraine U      In       Vitaliy Yatsenko went to pick ..."
## 
## text4 :
## "Iryna Tsy ukh rescued the wounded from Ukraine s  loodiest  ..."
## 
## text5 :
## "RIVNE  Ukraine    After Russia s full scale military invasio..."
## 
## text6 :
## "RIVNE  Ukraine U      After Russia s full scale military inv..."
## 
## [ reached max_ndoc ... 2,615 more documents ]

Lower-case and Remove Stopwords from the Corpus

# Tokenize, lower-case and remove stopwords 

tokens_news = corpus_news %>% 
  tokens() %>% 
  tokens_tolower() %>% 
  tokens_remove(stopwords("english"))
tokens_news
## Tokens consisting of 2,621 documents and 3 docvars.
## text1 :
##  [1] "kyiv"         "ukraine"      "vitaliy"      "yatsenko"     "picked"      
##  [6] "parcel"       "containing"   "illegal"      "amphetamines" "kyiv"        
## [11] "post"         "office"      
## [ ... and 459 more ]
## 
## text2 :
##  [1] "iryna"    "tsy"      "ukh"      "rescued"  "wounded"  "ukraine" 
##  [7] "s"        "loodiest" "attles"   "working"  "com"      "medic"   
## [ ... and 806 more ]
## 
## text3 :
##  [1] "kyiv"         "ukraine"      "u"            "vitaliy"      "yatsenko"    
##  [6] "went"         "pick"         "parcel"       "containing"   "illegal"     
## [11] "amphetamines" "kyiv"        
## [ ... and 671 more ]
## 
## text4 :
##  [1] "iryna"    "tsy"      "ukh"      "rescued"  "wounded"  "ukraine" 
##  [7] "s"        "loodiest" "attles"   "working"  "com"      "medic"   
## [ ... and 811 more ]
## 
## text5 :
##  [1] "rivne"    "ukraine"  "russia"   "s"        "full"     "scale"   
##  [7] "military" "invasion" "ukraine"  "ruptly"   "stopped"  "uying"   
## [ ... and 614 more ]
## 
## text6 :
##  [1] "rivne"    "ukraine"  "u"        "russia"   "s"        "full"    
##  [7] "scale"    "military" "invasion" "ukraine"  "ruptly"   "stopped" 
## [ ... and 381 more ]
## 
## [ reached max_ndoc ... 2,615 more documents ]

Create a new column that contains the cleaned tokens of each article

# Create a list of tokens for each document in the corpus and assign 
# it to the dataframe as column 

vector_vectors <- list()
for (i in 1:nrow(newspapers)) {
  token_vector <- tokens_news[[i]]
  vector_vectors <- c(vector_vectors, list(token_vector))
}

newspapers$tokens<- vector_vectors

newspapers[1:10,c(2,5)]

Identify difficult to detect duplicate articles, that were not detected previously

For each article we choose first 4 tokens from the cleaned list of tokens of each article. Articles with the same 4 tokens are classified as duplicates and removed.

# Remove articles that appear different but they are actually duplicates 

# Identify duplicates by looking at articles with the same first 4 tokens 

newspapers <- newspapers %>% 
  mutate(
    four_elements = map(tokens, ~ .x[1:4])
  )

duplicates <- duplicated(newspapers[, "four_elements"])
print(sum(duplicates))
## [1] 345
newspapers <- newspapers[!duplicated(newspapers[, "four_elements"]), ]

newspapers <- newspapers %>% select(-tokens,-four_elements)

There are 345 duplicate articles, which are removed.

Lemmatization

Lemmatization is used to obtain the root word of each word, and thus keep only essential information. The below pipeline is used, to further lower-case the already cleaned corpus, remove stopwords and then lemmatize the tokens to only keep their root. In addition, tokens which consist of only one character are removed since they offer little to no information. Then a Document-Term Matrix from tokenized text data and a frequency matrix to analyze term frequencies across the corpus are generated.

# Lemmatization of tokens and only keep tokens that are not a single character 

corpus_news = corpus(newspapers)

tokens_news = corpus_news %>% 
  tokens() %>% 
  tokens_tolower() %>% 
  tokens_remove(stopwords("english")) %>% 
  tokens_replace(pattern = lexicon::hash_lemmas$token, replacement = lexicon::hash_lemmas$lemma) %>%
  tokens_select(min_nchar = 2) 
tokens_news
## Tokens consisting of 2,276 documents and 3 docvars.
## text1 :
##  [1] "kyiv"        "ukraine"     "vitaliy"     "yatsenko"    "pick"       
##  [6] "parcel"      "contain"     "illegal"     "amphetamine" "kyiv"       
## [11] "post"        "office"     
## [ ... and 434 more ]
## 
## text2 :
##  [1] "iryna"    "tsy"      "ukh"      "rescue"   "wound"    "ukraine" 
##  [7] "loodiest" "attles"   "work"     "com"      "medic"    "ms"      
## [ ... and 751 more ]
## 
## text3 :
##  [1] "kyiv"        "ukraine"     "vitaliy"     "yatsenko"    "go"         
##  [6] "pick"        "parcel"      "contain"     "illegal"     "amphetamine"
## [11] "kyiv"        "post"       
## [ ... and 631 more ]
## 
## text4 :
##  [1] "rivne"    "ukraine"  "russia"   "full"     "scale"    "military"
##  [7] "invasion" "ukraine"  "ruptly"   "stop"     "uying"    "nuclear" 
## [ ... and 574 more ]
## 
## text5 :
##  [1] "rivne"    "ukraine"  "russia"   "full"     "scale"    "military"
##  [7] "invasion" "ukraine"  "ruptly"   "stop"     "uying"    "nuclear" 
## [ ... and 355 more ]
## 
## text6 :
##  [1] "colleville"    "sur"           "mer"           "france"       
##  [5] "president"     "iden"          "use"           "have"         
##  [9] "day"           "commemoration" "along"         "windswept"    
## [ ... and 561 more ]
## 
## [ reached max_ndoc ... 2,270 more documents ]
# Document - Term Matrix (Bag of Words) 

dtm <- tokens_news %>% 
  dfm()
dtm
## Document-feature matrix of: 2,276 documents, 17,594 features (99.11% sparse) and 3 docvars.
##        features
## docs    kyiv ukraine vitaliy yatsenko pick parcel contain illegal amphetamine
##   text1    4       8       1        5    1      1       1       1           1
##   text2    1      13       0        0    0      0       0       0           0
##   text3    5      10       1        7    1      1       1       1           1
##   text4    3      20       0        0    0      0       0       0           0
##   text5    1      18       0        0    0      0       0       0           0
##   text6    1       3       0        0    0      0       0       0           0
##        features
## docs    post
##   text1    1
##   text2    0
##   text3    1
##   text4    0
##   text5    0
##   text6    0
## [ reached max_ndoc ... 2,270 more documents, reached max_nfeat ... 17,584 more features ]
# Frequency Matrix

textstat_frequency(dtm)

We observe that the DTM is quite sparce with 99.11 percent of entries being zero. Moreover, from the frequency matrix, we observe that the verb “say” has a really high frequency, which does not offer any importance as a word. We continue by trimming the DTM matrix, so that tokens that appear to more that 75% of the articles (documents) are removed while tokens that appear in fewer that 5% of the articles, are also removed.

Trimming of the DTM and Visualization of the most frequent words

# Trim the DTM to keep tokens that appear in fewer
# than 0.5 percent and below 75 percent of the documents 

dtm_tr = dfm_trim(dtm, min_docfreq = 0.005, 
                  max_docfreq = 0.75,
                  docfreq_type = "prop")


textplot_wordcloud(dtm, max_words=150,min_size=1, max_size = 4,random_order = F,
                   color = rev(RColorBrewer::brewer.pal(5, "Dark2")))

Extraction of Collocations and inclusion of them in the corpus

Collocations are words that always appear together in the text. Examples are “tel aviv” or “los angeles”. They can offer useful information for the topic model used, thus are included in the corpus.

# Extract collocations in the tokens and print the most important in terms of context

colloc = tokens_news %>% 
  textstat_collocations(min_count=30) %>% 
  as_tibble()

print(colloc %>% arrange(-lambda), n=50)
## # A tibble: 1,057 × 6
##    collocation     count count_nested length lambda     z
##    <chr>           <int>        <int>  <dbl>  <dbl> <dbl>
##  1 hez ollah         536            0      2   21.0 10.5 
##  2 carrie keller      32            0      2   18.2  9.05
##  3 keller lynn        32            0      2   16.6 10.6 
##  4 vivian salama      34            0      2   16.3 10.7 
##  5 anat peled         59            0      2   16.1 15.4 
##  6 ca inet           172            0      2   16.0 11.2 
##  7 jared malsin       49            0      2   16.0 10.9 
##  8 tel aviv          212            0      2   15.9 21.2 
##  9 ki utz             73            0      2   15.7 10.9 
## 10 stephen kalin      31            0      2   15.4 10.5 
## 11 istan ul           33            0      2   15.2 10.4 
## 12 los angeles        32            0      2   15.1 10.4 
## 13 khan younis       159            0      2   15.0 10.6 
## 14 fe ruary          195            0      2   15.0 10.6 
## 15 lindsay wise       35            0      2   14.4 17.9 
## 16 repu licans       643            0      2   14.2 10.0 
## 17 antony linken     136            0      2   13.8  9.74
## 18 ja arin            34            0      2   13.7  9.57
## 19 xi jinping         35            0      2   13.6  9.52
## 20 ismail haniyeh     31            0      2   13.6 15.6 
## 21 en gvir            43            0      2   13.6  9.52
## 22 repu lican        429            0      2   13.5  9.51
## 23 neigh orhood       57            0      2   13.1  9.23
## 24 stolten erg        30            0      2   13.1  9.16
## 25 colum ia           62            0      2   13.0  9.16
## 26 pu licly          179            0      2   12.9  9.10
## 27 lloyd austin       42            0      2   12.9 21.4 
## 28 emmanuel macron    30            0      2   12.8  8.98
## 29 kerem shalom       33            0      2   12.8 18.6 
## 30 em assy            62            0      2   12.8  8.99
## 31 mitch mcconnell    51            0      2   12.8  8.96
## 32 ultra orthodox     39            0      2   12.7 15.0 
## 33 yoav gallant       80            0      2   12.5 15.1 
## 34 om ardment         83            0      2   12.3  8.69
## 35 contri uted       368            0      2   12.3 45.4 
## 36 asylum seeker      48            0      2   12.1 14.5 
## 37 ja alia            33            0      2   12.1 17.9 
## 38 rear adm           36            0      2   12.1 25.0 
## 39 le anon           315            0      2   12.0  8.51
## 40 gordon lu          38            0      2   12.0 24.2 
## 41 jake sullivan      60            0      2   12.0 21.2 
## 42 com ined           38            0      2   11.8  8.31
## 43 yahya sinwar       77            0      2   11.8 18.2 
## 44 pro lem           138            0      2   11.8 14.3 
## 45 novem er          203            0      2   11.8  8.30
## 46 neigh ors          45            0      2   11.7 14.1 
## 47 octo er           195            0      2   11.7  8.27
## 48 li eral            52            0      2   11.7 17.8 
## 49 su stantial        31            0      2   11.6  8.16
## 50 uted article      345            0      2   11.6 52.4 
## # ℹ 1,007 more rows

The lambda index signifies how important a collocation is, thus the above matrix is filtered based on lambda. We chose to keep collocations with lambda above 4 and append those collocations to the cleaned tokens list as shown below.

# Add the collocations to the tokens list 

collocations = colloc  %>% 
  filter(lambda > 4)  %>%  
  pull(collocation)  %>%  
  phrase()

tokens_news_col <- tokens_news %>%  tokens_compound(collocations)

DTM Matrix Creation, Trimming and WordCloud Visualization of the tokens including the collocations

# Investigate DTM, Token Statistics and Workcloud 

dtm_col <- tokens_news_col %>% 
  dfm()

textstat_frequency(dtm_col)
dtm_col_tr = dfm_trim(dtm_col, min_docfreq = 0.005, 
                  max_docfreq = 0.75,
                  docfreq_type = "prop")

textplot_wordcloud(dtm_col_tr, max_words=150,min_size=1, max_size = 4,
                   color = rev(RColorBrewer::brewer.pal(4, "RdBu")))

We observe in the wordcloud that many tokens are verbs such as “see”, “meet”, “run”, “know” while the dominant token is a verb itself, “say”. In the following code, we apply spacy’s POS Tagger to identify the part of speech of each token. Then we proceed to keep only NOUNS and PROPER NOUNS (which refer to an entity) and then create a DTM and WordCloud of only those type of tags.

In the following code, spacy’ POS Tagger is installed and saved. The goal is to apply the tagger on the preprocessed corpus (without the collocations). A new column is created where each row contains a character vector of the already preprocessed tokens of each article. Then for each row, the tokes are joined together and assigned to a new column. A corpus of this new column is then created for the POS Tagger to be applied on.

head(newspapers[[1,2]])
## [1] "KYIV  Ukraine    In       Vitaliy Yatsenko picked up a parcel containing illegal amphetamines from a Kyiv post office and was met  y    policemen  This week he will cut short his five year prison sentence to join Ukraine s stretched armed forces  In a sign of the Ukrainian military s desperate need for fresh troops  Kyiv is taking a page from Russia s play ook  y recruiting inmates from prisons to serve in its military  The government says       convicts have applied for the program in which prisoners will have to serve until the end of the war  efore winning their freedom  Kyiv is faced with stark choices as an initial wave of volunteers fades and they lose ground against an enemy that can draw on a population       times as large  Many front line units say they are depleted and exhausted  and Ukraine is struggling to draft enough men to hold off Russian offensives  In search of hundreds of thousands of new soldiers  Ukraine has lowered the age of mo ilization  increased financial compensation for troops and sought to coerce military age men who fled a road to return and fight  This week  Yatsenko will leave his prison and join the military  For men like this    year old  the program is a chance for redemption   I feel ashamed to  e in prison   he said   This is my chance to  e useful   Yatsenko doesn t know where he will  e sent or what role he will  e given  He has yet to tell his mother   ut said he is driven in part  y a desire to make her proud  Convicts have  een used in wartime through much of history  often in the most dangerous roles  Napoleon deployed penal  rigades and  oth Nazi Germany and the Soviet Union drafted criminals and political prisoners  After World War II the practice mostly ended  The Ukraine war has led to a resurgence  Russia s Wagner militia  egan to recruit convicts soon after its Fe ruary      invasion started to go awry  Ukraine s program will differ in several respects  Unlike in Russia  those convicted of certain crimes won t  e eligi le  That includes those with convictions for sexual violence  traffic accidents that led to deaths  and murder if it was of more than one person or carried out with  particular cruelty   among other restrictions  said Olena Vysotska  a deputy Ukrainian justice minister  While Russian prisoners will mainly get their criminal record expunged after service  Ukrainians won t  Ukraine s Ministry of Justice estimates that authorities can recruit a out       people from prisons  Russia never confirmed the total num er of convicts it recruited  ut figures from the prison service show a reduction of more than        in its total prison population  etween May      and January       the peak of Wagner s recruitment  Convicts will  e placed in special units   ut it isn t clear what they will  e tasked to do  Ukraine s Ministry of Defense didn t comment  though the country tends to take fewer risks with its soldiers than Russia does  Volodymyr  arandich  another recruit  said he is impatient to leave jail for a front line position  A out six months ago  arandich was an army corporal serving near the town of Avdiivka  one of the front line s most dangerous spots  when he was sentenced for a drug dealing offense  He maintains his innocence and said he was set up  y a former friend   I felt ashamed   ecause I was in here and my colleagues were still at the front   he said  He has almost five years of his sentence to run  The    year old had  een in the military for six years when he was jailed  During his time in prison he said he never lost the am ition to return to the front line  Then in May  he was in a prison workshop when another convict told him that a law had passed that would allow those in jail to serve   Finally   he said he remem ers thinking  Neither  arandich nor Yatsenko say they are nervous a out fighting  Vysotska  the deputy justice minister  said there are patriots among convicts who want to reha ilitate themselves  A prison service should emphasize correcting  ehavior and resocializing people  not incarceration for the sake of it  she said  Yatsenko says other prisoners told him they will see how he and other convicts fare  efore deciding  On a recent visit to their prison   ored looking men stood in courtyards smoking  Some la ored under a hot sun    ut prison life is like a summer holiday camp  compared with the front  said  arandich      Oksana Pyrozhok and Ievgeniia Sivorka contri uted to this article "

POS TAGGING: Identify Nouns and Proper nouns

# install.packages("spacyr")
library(spacyr)
# spacy_install()
spacy_initialize(model = "en_core_web_sm")
## successfully initialized (spaCy Version: 3.7.5, language model: en_core_web_sm)
# Create a list of the tokens for each document and assign it as a column to the dataframe  

vector_vectors <- list()
for (i in 1:nrow(newspapers)) {
  token_vector <- tokens_news[[i]]
  vector_vectors <- c(vector_vectors, list(token_vector))
}

newspapers$tokens<- vector_vectors

# Join the tokens list into text and replace the old text column 

combine_tokens <- function(token_list) {
  joined = str_c(token_list, collapse=" ")
  return(joined)
}

newspapers$text <- sapply(newspapers$tokens, combine_tokens)

# Obtain the corpus of the new text column 

corpus_new <- corpus(newspapers)
corpus_new
## Corpus consisting of 2,276 documents and 4 docvars.
## text1 :
## "kyiv ukraine vitaliy yatsenko pick parcel contain illegal am..."
## 
## text2 :
## "iryna tsy ukh rescue wound ukraine loodiest attles work com ..."
## 
## text3 :
## "kyiv ukraine vitaliy yatsenko go pick parcel contain illegal..."
## 
## text4 :
## "rivne ukraine russia full scale military invasion ukraine ru..."
## 
## text5 :
## "rivne ukraine russia full scale military invasion ukraine ru..."
## 
## text6 :
## "colleville sur mer france president iden use have day commem..."
## 
## [ reached max_ndoc ... 2,270 more documents ]
head(newspapers[[1,2]])
## [1] "kyiv ukraine vitaliy yatsenko pick parcel contain illegal amphetamine kyiv post office meet policeman week will cut short five year prison sentence join ukraine stretch arm force sign ukrainian military desperate need fresh troop kyiv take page russia play ook recruit inmate prison serve military government say convict apply program prisoner will serve end war efore win freedom kyiv face stark choice initial wave volunteer fade lose grind enemy can draw population time large many front line unit say deplete exhaust ukraine struggle draft enough man hold russian offensive search hundred thousand new soldier ukraine lower age mo ilization increase financial compensation troop seek coerce military age man flee road return fight week yatsenko will leave prison join military man like year old program chance redemption feel ashamed prison say chance useful yatsenko doesn know will send role will give yet tell mother ut say drive part desire make proud convict een use wartime much history often dangerous role napoleon deploy penal rigades oth nazi germany soviet union draft criminal political prisoner world war ii practice mostly end ukraine war lead resurgence russia wagner militia egan recruit convict soon fe ruary invasion start go awry ukraine program will differ several respect unlike russia convict certain crime win eligi le include conviction sexual violence traffic accident lead death murder one person carry particular cruelty among restriction say olena vysotska deputy ukrainian justice minister russian prisoner will mainly get criminal record expunge service ukrainian win ukraine ministry justice estimate authority can recruit people prison russia never confirm total num er convict recruit ut figure prison service show reduction total prison population etween may january peak wagner recruitment convict will place special unit ut isn clear will task ukraine ministry defense didn comment though country tend take few risk soldier russia volodymyr arandich another recruit say impatient leave jail front line position six month ago arandich army corporal serve near town avdiivka one front line dangerous spot sentence drug deal offense maintain innocence say set former friend feel ashamed ecause colleague still front say almost five year sentence run year old een military six year jail time prison say never lose ition return front line may prison workshop another convict tell law pass allow jail serve finally say remem er think neither arandich yatsenko say nervous fight vysotska deputy justice minister say patriot among convict want reha ilitate prison service emphasize correct ehavior resocializing people incarceration sake say yatsenko say prisoner tell will see convict fare efore decide recent visit prison ored look man stand courtyard smoke la ored hot sun ut prison life like summer holiday camp compare front say arandich oksana pyrozhok ievgeniia sivorka contri uted article"

Apply the POS Tagger - Keep only tokens that are Nouns or Proper Nouns

# Apply the tagger on the new corpus 

# Obtain dataframe of pos tag per token in the corpus 
pos_tags <- spacy_parse(corpus_new, 
                        lemma = FALSE,
                        pos = TRUE,
                        entity = FALSE)

# Keep only tokens of noun or propn tags 
pos_tags_nouns <- pos_tags[pos_tags$pos == "NOUN" | pos_tags$pos == "PROPN", ]

print(head(pos_tags_nouns))
##   doc_id sentence_id token_id    token   pos
## 1  text1           1        1     kyiv PROPN
## 2  text1           1        2  ukraine PROPN
## 3  text1           1        3  vitaliy PROPN
## 4  text1           1        4 yatsenko PROPN
## 5  text1           1        5     pick PROPN
## 6  text1           1        6   parcel  NOUN
# Create tokens per document dataframe 
document_per_tokens <- pos_tags_nouns  %>%
  group_by(doc_id) %>%
  summarise(text = str_c(token, collapse=" ")) %>% 
  mutate(digits = str_extract_all(doc_id, "\\d")) %>% 
  mutate(digits = sapply(digits, function(x) paste(x, collapse = ""))) %>% 
  mutate(digits = as.numeric(digits)) %>% 
  arrange(digits)

print(document_per_tokens)
## # A tibble: 2,276 × 3
##    doc_id text                                                            digits
##    <chr>  <chr>                                                            <dbl>
##  1 text1  kyiv ukraine vitaliy yatsenko pick parcel amphetamine kyiv pos…      1
##  2 text2  iryna ukh rescue wound ukraine attles work com medic ms ukh sl…      2
##  3 text3  kyiv ukraine vitaliy yatsenko go pick parcel amphetamine kyiv …      3
##  4 text4  ukraine russia scale invasion ukraine fuel moscow supplier ind…      4
##  5 text5  ukraine russia scale invasion ukraine fuel moscow supplier ind…      5
##  6 text6  colleville sur mer france president iden use day commemoration…      6
##  7 text7  vladimir putin portray defender glo al sta ility nation offer …      7
##  8 text8  wing party look surge election europe week shock wave rift for…      8
##  9 text9  missile launch ukraine year war russia photo adrienne surprena…      9
## 10 text10 half measure ukraine june complain president iden hasn advance…     10
## # ℹ 2,266 more rows

DTM and Frequency Matrix of the NOUN AND PROPN tags

# Create DTM for the noun-propn tokens 
corpus_documents_nouns <- corpus(document_per_tokens)

tokens_nouns = corpus_documents_nouns %>% 
  tokens()

dtm_nouns = tokens_nouns  %>% 
  dfm()

textstat_frequency(dtm_nouns)

We observe that now tokens are more clearly defined and appear cleaner.

Trimming of the new DTM and WordCloud Visualization

# Trim the DTM 
dtm_nouns_tr = dfm_trim(dtm_nouns, min_docfreq = 0.005, 
                        max_docfreq = 0.75,
                        docfreq_type = "prop")

textstat_frequency(dtm_nouns_tr)
textplot_wordcloud(dtm_nouns_tr, max_words=150,min_size=1, max_size = 4,random_order = F,
                   color = rev(RColorBrewer::brewer.pal(4, "Dark2")))

By only using nouns and proper nouns as tokens, the wordcloud is far more cleaner than the previous wordcloud. The most frequent words appear to be semantically important in relation to the topics under discussion.

Add the previously found collocations

# Add collocations to the noun and personal nouns tokens list 

tokens_nouns_col <- tokens_nouns %>%  tokens_compound(collocations)

dtm_nouns_col = tokens_nouns_col  %>% 
  dfm()

dtm_nouns_col_tr = dfm_trim(dtm_nouns_col, min_docfreq = 0.005, 
                        max_docfreq = 0.75,
                        docfreq_type = "prop")

textplot_wordcloud(dtm_nouns_col_tr, max_words=150,min_size=1, max_size = 4,random_order = F,
                   color = rev(RColorBrewer::brewer.pal(5, "Dark2")))

It appears that the most frequent tokens are “israel” followed by “ukraine”, “hamas”, “gaza”, “russia”, “year”, “people” and “attack”. In the following part of our project, we proceed with the Topic Modeling. Model of choice used is LDA.

7. LDA - Unsupervised Topic Modeling

In order to identify the topics hidden in our articles a topic modeling technique called Latent Dirichlet Distribution (LDA) is used [7]. LDA is an unsupervised method (unlabeled data) that creates clusters of words, where is cluster contains words that together form a certain topic. The topic is a latent construct to be labeled by the user. The number of topics, k, to be estimated is decided based on certain methods as shown below.

Number of Topics

Two methods are used to decide on the number of topics. The 1st method, calculates 4 metrics [8-11]. The number of topics is chosen on the points where the first 2 metrics are minimized and the latter 2 metrics are maximized.

1st Method

result <- FindTopicsNumber(
  dtm_nouns_col_tr,
  topics = seq(from = 2, to = 15, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)
## fit models... done.
## calculate metrics:
##   Griffiths2004... done.
##   CaoJuan2009... done.
##   Arun2010... done.
##   Deveaud2014... done.
FindTopicsNumber_plot(result)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ldatuning package.
##   Please report the issue at <https://github.com/nikita-moor/ldatuning/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From the plot, all metrics seem to converge at a number of topics number of 15 and above. For our analysis, a number of topics k = 15 is chosen based on this method.

2nd Method

A LDA model is fitted for each value of k, the number of topics. For each fit, a coherence score is calculated as in [12]. A high coherence score, means the assigned words in a topic’s cluster, are more related to each other and thus the cluster is more coherent.

A LDA model is fitted for each of the selected number of topics, k. In our analysis, a model was fitted for each of the 15 number of topics.

normalize <- function(scores) {
  min_score <- min(scores)
  max_score <- max(scores)
  (scores - min_score) / (max_score - min_score)
}

topic_coherence <- function(topic_model, dtm_data, top_n_tokens = 10,
                            smoothing_beta = 1){
  if (!contain_equal_docs(topic_model, dtm_data)) {
    stop("The topic model object and document-term matrix contain an unequal number of documents.")
  }
  
  UseMethod("topic_coherence")
}
#' @export
topic_coherence.TopicModel <- function(topic_model, dtm_data, top_n_tokens = 10,
                                       smoothing_beta = 1){
  # Get top terms for each topic
  top_terms <- terms(topic_model, top_n_tokens)
  
  # Coerce document-term matrix to simple triplet matrix
  dtm_data <- as.simple_triplet_matrix(dtm_data)
  
  # Apply coherence calculation to all topics' top terms
  unname(apply(top_terms, 2, coherence, dtm_data = dtm_data, smoothing_beta = smoothing_beta))
}

#' Helper function for calculating coherence for a single topic's worth of terms
#'
#' @param dtm_data a document-term matrix of token counts coercible to \code{simple_triplet_matrix}
#' @param top_terms a character vector of the top terms for a given topic
#' @param smoothing_beta a numeric indicating the value to use to smooth the document frequencies
#' in order avoid log zero issues, the default is 1
#'
#' @importFrom slam tcrossprod_simple_triplet_matrix
#'
#' @keywords internal
#'
#' @return a numeric indicating coherence for the topic

coherence <- function(dtm_data, top_terms, smoothing_beta){
  # Get the relevant entries of the document-term matrix
  rel_dtm <- dtm_data[,top_terms]
  
  # Turn it into a logical representing co-occurences
  df_dtm <- rel_dtm > 0
  
  # Calculate document frequencies for each term and all of its co-occurences
  cooc_mat <- tcrossprod_simple_triplet_matrix(t(df_dtm))
  
  # Quickly get the number of top terms for the for-loop below
  top_n_tokens <- length(top_terms)
  
  # Using the syntax from the paper, calculate coherence
  c_l <- 0
  for (m in 2:top_n_tokens) {
    for (l in 1:(top_n_tokens - 1)) {
      df_ml <- cooc_mat[m,l]
      df_l <- cooc_mat[l,l]
      c_l <- c_l + log((df_ml + smoothing_beta) / df_l)
    }
  }
  c_l
}

contain_equal_docs <- function(topic_model, dtm_data){
  if (inherits(topic_model, "TopicModel")) {
    topic_model@Dim[1] == nrow(dtm_data)
  }
}
# Fit LDA model with different k and calculate Mean Coherence per Fitted LDA model 

topics_vector <- c()
coherence_scores_vector <- c()

for (k_topic in 2:15) {
  
  # lda = dtm_nouns_tr %>%
  #   convert(to = "topicmodels") %>%
  #   LDA(k=k_topic,control=list(seed=123, alpha = 1/1:k_topic))
  
  lda <- LDA(dtm_nouns_col_tr, k = k_topic, control = list(seed=1234))
  
  coherence_scores <- topic_coherence(lda, dtm_nouns_col_tr)
  
  coherence_score <- mean(normalize(coherence_scores))
  
  coherence_scores_vector <- c(coherence_scores_vector, coherence_score)
  topics_vector <- c(topics_vector, k_topic)
  
  print(paste("Iteration for k =", k_topic))
}
## [1] "Iteration for k = 2"
## [1] "Iteration for k = 3"
## [1] "Iteration for k = 4"
## [1] "Iteration for k = 5"
## [1] "Iteration for k = 6"
## [1] "Iteration for k = 7"
## [1] "Iteration for k = 8"
## [1] "Iteration for k = 9"
## [1] "Iteration for k = 10"
## [1] "Iteration for k = 11"
## [1] "Iteration for k = 12"
## [1] "Iteration for k = 13"
## [1] "Iteration for k = 14"
## [1] "Iteration for k = 15"
coherence_per_topic <- data.frame(topics = topics_vector, coherence_values = coherence_scores_vector )

ggplot(data = coherence_per_topic, aes(x = topics, y = coherence_scores_vector, group = 1)) +
  geom_line(color = "blue", size = 1.5) +
  geom_point(size = 3) +  # Increase the point size
  ggtitle("Mean Coherence Among Topics per Fitted LDA") +
  labs(x = "Number of Topics (k)", y = "Mean Coherence Score") +
  scale_x_continuous(breaks = seq(min(coherence_per_topic$topics), max(coherence_per_topic$topics), by = 1))

According to the mean coherence score, a LDA model with k = 9 number of topics seems to be ideal. Nevertheless, k=15 was chosen, since this number also provides a relatively high coherence score and coincides with the ideal number of topics from the 1st method, as previously shown.

Fit LDA Model with k=15

# Fit LDA with the chosen k from the above methods 

lda <- LDA(dtm_nouns_col_tr, k = 15, control = list(seed=1234))

terms(lda, 10)
##       Topic 1       Topic 2    Topic 3  Topic 4     Topic 5     Topic 6       
##  [1,] "israel"      "israel"   "child"  "israel"    "israel"    "court"       
##  [2,] "gaza"        "gaza"     "family" "iran"      "trump"     "russia"      
##  [3,] "netanyahu"   "hamas"    "year"   "attack"    "iden"      "israel"      
##  [4,] "hamas"       "military" "day"    "hez_ollah" "president" "law"         
##  [5,] "war"         "rafah"    "people" "strike"    "election"  "war"         
##  [6,] "government"  "war"      "video"  "war"       "war"       "country"     
##  [7,] "palestinian" "force"    "home"   "missile"   "democrat"  "genocide"    
##  [8,] "west_ank"    "people"   "time"   "tehran"    "american"  "south_africa"
##  [9,] "security"    "official" "man"    "country"   "voter"     "prosecutor"  
## [10,] "ara"         "city"     "war"    "syria"     "people"    "state"       
##       Topic 7      Topic 8     Topic 9   Topic 10     Topic 11     Topic 12  
##  [1,] "israel"     "ukraine"   "ukraine" "israel"     "hamas"      "ukraine" 
##  [2,] "student"    "russia"    "drone"   "war"        "israel"     "order"   
##  [3,] "mr"         "force"     "russia"  "gaza"       "hostage"    "aid"     
##  [4,] "university" "ukrainian" "missile" "china"      "gaza"       "vote"    
##  [5,] "school"     "soldier"   "weapon"  "official"   "cease_fire" "house"   
##  [6,] "protest"    "troop"     "attack"  "conflict"   "deal"       "senate"  
##  [7,] "campus"     "war"       "use"     "washington" "official"   "democrat"
##  [8,] "hamas"      "line"      "strike"  "president"  "release"    "johnson" 
##  [9,] "jew"        "city"      "system"  "iden"       "talk"       "illion"  
## [10,] "year"       "year"      "defense" "state"      "group"      "licans"  
##       Topic 13     Topic 14       Topic 15  
##  [1,] "company"    "hamas"        "ukraine" 
##  [2,] "year"       "group"        "russia"  
##  [3,] "price"      "israel"       "war"     
##  [4,] "country"    "intelligence" "year"    
##  [5,] "oil"        "official"     "country" 
##  [6,] "war"        "attack"       "support" 
##  [7,] "illion"     "agency"       "europe"  
##  [8,] "market"     "gaza"         "eu"      
##  [9,] "economy"    "accord"       "nato"    
## [10,] "government" "security"     "european"

The above table shows the results of the LDA model. Topics 1,2,4,5,7,10,11 and 14 seem to be related with the conflict in Israel while topics 6,8,9,12 and 15 seem to be related with the Ukraine War. A further dynamic visualization using the LDAvis package [13], allows for better exploration of the topics.

Document Term Matrix

dtm_nouns_col_tr
## Document-feature matrix of: 2,276 documents, 3,002 features (97.24% sparse) and 1 docvar.
##        features
## docs    kyiv ukraine pick post office meet week year prison sentence
##   text1    4       6    1    1      1    1    2    5     12        3
##   text2    0      12    0    0      0    0    1    1      0        0
##   text3    4       8    1    1      1    1    2    6     15        3
##   text4    3      18    0    0      0    0    0    4      0        0
##   text5    1      17    0    0      0    0    0    2      0        0
##   text6    1       2    0    0      0    1    1    6      0        0
## [ reached max_ndoc ... 2,270 more documents, reached max_nfeat ... 2,992 more features ]
# Top 15 Tokens per Topic 
# terms(lda, 10)

# Topic Probabilities per Token
ap_topics <- tidy(lda, matrix = "beta")
ap_topics
# Topic Probabilities per Document 
ap_documents <- tidy(lda, matrix = "gamma")
ap_documents
# Create Topic per Document dataframe for Visualizations 

topics = posterior(lda)$topics %>% 
  as_tibble() %>% 
  rename_all(~paste0("Topic_", .))

meta = docvars(corpus_documents_nouns)
meta$id <- meta$digits
meta$date <- newspapers$Month_Year
meta$title <- newspapers$Title
meta$publisher <- newspapers$Publisher

meta %>%  
  select(date:id) %>%
  add_column(doc_id=docnames(corpus_documents_nouns),.before=1)
tpd = bind_cols(meta, topics) 

tpd <- tpd %>%
  mutate(date = parse_date_time(date, "my"))

tpd$Assigned_Topic <- apply(tpd[, 6:20], 1, function(row) {
  colnames(tpd)[6:20][which.max(row)]
})

Results of LDA Model: Assigned Terms per Topics

# Obtain Top 10 tokens per Topic 
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 6) %>% 
  ungroup() %>%
  arrange(topic, -beta)

# Plot most frequent tokens per Topic 
ap_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

PyLDAavis Visualization of Topics

Interactive Visualization [13] of how terms are allocated among the topics.

##################### PyLDAavis Visualization of Topics #####################

phi <- posterior(lda)$terms %>% as.matrix
cat(paste0('Dimensions of phi (topic-token-matrix): ',paste(dim(phi),collapse=' x '),'\n'))
## Dimensions of phi (topic-token-matrix): 15 x 3002
cat(paste0('phi examples (8 tokens): ','\n'))
## phi examples (8 tokens):
phi[,1:8] %>% as_tibble() %>% mutate_if(is.numeric, round, 5) %>% print()
## # A tibble: 15 × 8
##       kyiv ukraine    pick    post  office    meet    week    year
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 0       0       0       0.00005 0.00106 0.00017 0.00264 0.00606
##  2 0       0       0       0.00014 0.00023 0       0.00507 0.00107
##  3 0       0       0.00071 0.00205 0.00045 0.00038 0.00155 0.0138 
##  4 0       0       0       0.00091 0       0.00048 0.00285 0.00301
##  5 0       0       0.00006 0.00331 0.00368 0.00077 0.00221 0.00567
##  6 0       0.00594 0       0.00097 0.003   0       0.00316 0.00825
##  7 0       0       0       0.00257 0.00102 0.00087 0.0052  0.00744
##  8 0.00701 0.0448  0.0001  0.00169 0.00102 0.0005  0.00396 0.0116 
##  9 0.00505 0.0430  0.00005 0.00093 0       0.00014 0.00376 0.00852
## 10 0       0       0       0.00003 0.00097 0.00182 0.00641 0.00276
## 11 0       0       0       0.00023 0.0006  0.00122 0.0096  0.00061
## 12 0.00156 0.0258  0.00008 0.00028 0.00061 0.00142 0.00711 0.0076 
## 13 0.00017 0.00433 0.00008 0.00174 0.00252 0.00005 0.0023  0.0201 
## 14 0       0       0.00008 0.00224 0.00261 0.00115 0.00226 0.0103 
## 15 0.00753 0.0636  0.00009 0.00052 0.00011 0.00139 0.00401 0.0153
theta <- posterior(lda)$topics %>% as.matrix
cat(paste0('\n\n','Dimensions of theta (document-topic-matrix): ',
           paste(dim(theta),collapse=' x '),'\n'))
## 
## 
## Dimensions of theta (document-topic-matrix): 2276 x 15
cat(paste0('theta examples (8 documents): ','\n'))
## theta examples (8 documents):
theta[1:8,] %>% as_tibble() %>% mutate_if(is.numeric, round, 5) %>% 
  setNames(paste0('Topic', names(.))) %>% print()
## # A tibble: 8 × 15
##    Topic1  Topic2  Topic3 Topic4  Topic5  Topic6  Topic7  Topic8  Topic9 Topic10
##     <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 0.00017 0.00017 0.0596  1.7e-4 0.00017 0.327   0.00017 0.611   0.00017 0.00017
## 2 0.0001  0.0001  0.238   1  e-4 0.0001  0.0001  0.0576  0.456   0.231   0.0001 
## 3 0.00012 0.00012 0.0863  1.2e-4 0.00012 0.341   0.00012 0.572   0.00012 0.00012
## 4 0.00013 0.00013 0.00013 1.3e-4 0.00013 0.00013 0.00013 0.00013 0.395   0.00013
## 5 0.0002  0.0002  0.0002  2  e-4 0.0002  0.0002  0.0002  0.0002  0.314   0.0002 
## 6 0.00014 0.00014 0.182   1.4e-4 0.00014 0.00014 0.0551  0.166   0.00014 0.188  
## 7 0.00008 0.00008 0.218   8  e-5 0.00008 0.00008 0.00008 0.102   0.0169  0.00008
## 8 0.00008 0.00008 0.00697 8  e-5 0.444   0.0129  0.00008 0.00008 0.00008 0.00008
## # ℹ 5 more variables: Topic11 <dbl>, Topic12 <dbl>, Topic13 <dbl>,
## #   Topic14 <dbl>, Topic15 <dbl>
vocab <- colnames(phi) 

doc_length <- newspapers %>% 
  mutate(
    number_tokens = map_int(tokens, length)
  ) %>% 
  select(Title,number_tokens) 

doc_length = doc_length %>% pull(number_tokens)

textstat_frequency(dtm_nouns_col_tr)
term_frequency <- textstat_frequency(dtm_nouns_col_tr) %>% 
  select(feature,frequency) %>% 
  arrange(match(feature,vocab))

term_frequency = term_frequency  %>% pull(frequency)

json <- createJSON(phi, theta, doc_length, vocab, term_frequency)

serVis(json)
## Loading required namespace: servr

An indicative screenshot of the visualization is provided:

PyLDAvis - Topic Allocation
PyLDAvis - Topic Allocation
PyLDAvis - Topic Allocation
PyLDAvis - Topic Allocation

Assignment of Topic Labels

We can observe that terms such as “trump”, “democratic party”, “michigan”, “election”, “biden” are prevalent in this specific cluster. This cluster is numbered as 12 in the pyLDAvis visualization but corresponds to topic 5 from the LDA model output. This topic’s title could be about articles concerning the U.S. Politics. Similarly, the remaining clusters are assigned topics as such:

Topic 1: Israel-Hamas War Front Topic 2: Israel-Hamas War Front Topic 3: Humanitarian Loss-War Stories Topic 4: Israel-Iran Tensions Topic 5: U.S. Politics-Elections Topic 6: International Court Interventions & World Unrest Topic 7: Student War Protests Topic 8: Ukraine-Russia War Front Topic 9: Ukraine-Russia War Front Topic 10: U.S.A-China Diplomacy on Israel Topic 11: Hostages & Ceasefire Negotiations Topic 12: U.S. Politics-War Aid Topic 13: Impact on World Economy - Sanctions Topic 14: Israel-Hamas Conflict Topic 15: U.S & E.U. Politics-War Aid

# Rename the topics to match the PyLDAavis Visualization 

tpd_labels <- tpd %>% select(date,title,publisher,Assigned_Topic)

tpd_labels <- tpd_labels %>%
  mutate(Assigned_Topic = case_when(
    Assigned_Topic == "Topic_1" ~ "Israel-Hamas War Front",
    Assigned_Topic == "Topic_2" ~ "Israel-Hamas War Front",
    Assigned_Topic == "Topic_3" ~ "Humanitarian Loss-War Stories",
    Assigned_Topic == "Topic_4" ~ "Israel-Iran Tensions ",
    Assigned_Topic == "Topic_5" ~ "U.S. Politics-Elections",
    Assigned_Topic == "Topic_6" ~ "International Court Interventions & World Unrest",
    Assigned_Topic == "Topic_7" ~ "Student War Protests",
    Assigned_Topic == "Topic_8" ~ "Ukraine-Russia War Front",
    Assigned_Topic == "Topic_9" ~ "Ukraine-Russia War Front",
    Assigned_Topic == "Topic_10" ~ "U.S.A-China Diplomacy on Israel",
    Assigned_Topic == "Topic_11" ~ "Hostages & Ceasefire Negotiations",
    Assigned_Topic == "Topic_12" ~ "U.S. Politics-War Aid",
    Assigned_Topic == "Topic_13" ~ "Impact on World Economy - Sanctions",
    Assigned_Topic == "Topic_14" ~ "Israel-Hamas Conflict",
    Assigned_Topic == "Topic_15" ~ "U.S & E.U. Politics-War Aid",
    TRUE ~ as.character(Assigned_Topic)  # Keep the original value for all other cases
  ))

head(tpd_labels)

Frequency of topics across newspapers

########## Topics with the highest percentage of articles 

articles_per_topic <- tpd_labels %>% 
  group_by(Assigned_Topic) %>%
  summarize(
    total_articles = n()) %>% arrange(-total_articles)

articles_per_topic$rel_freq <- articles_per_topic$total_articles / sum(articles_per_topic$total_articles)
# 
# articles_per_topic <- articles_per_topic %>%
#   mutate(Assigned_Topic = factor(Assigned_Topic, levels = c("Topic_1","Topic_2","Topic_3",
#                                                             "Topic_4","Topic_5","Topic_6","Topic_7",
#                                                             "Topic_8","Topic_9","Topic_10","Topic_11",
#                                                             "Topic_12","Topic_13","Topic_14","Topic_15")))

ggplot(data = articles_per_topic, aes(x = Assigned_Topic, y = rel_freq)) +
  geom_bar(stat = 'identity',color = "red", fill = "red") +
  labs(title = "Relative Frequency of Articles per Topic",
       x = "Topic",
       y = "Relative Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From the plot we observe that the majority of the articles from both newspapers, belong to the Israel-Hamas and Ukraine-Russia War Fronts. The next most frequent articles are of topics related to Ceasefire and Aid.

Which topics are the most dominant per newspaper

######## Topic Dominance per Newspaper

date_topic_size <- tpd_labels %>%
  group_by(publisher, date, Assigned_Topic) %>%
  summarize(count = n(),
            .groups = 'drop')

date_topic_size <- date_topic_size %>%
  filter(date != as.Date('2023-01-01'))

date_topic_size <- date_topic_size %>%
  filter(date != as.Date('2024-06-01'))

date_topic_size <- date_topic_size %>%
  filter(publisher != "International New York Times")

date_topic_size <- date_topic_size %>% 
  group_by(publisher) %>% 
  mutate(total_articles = sum(count)) %>% 
  group_by(publisher,Assigned_Topic) %>% 
  mutate(total_articles_per_topic = sum(count))

date_topic_size$relative_freq = date_topic_size$total_articles_per_topic / date_topic_size$total_articles

# Define a custom palette with 13 colors
custom_palette <- c(
  brewer.pal(12, "Paired"),  # Use the 12 colors from the "Paired" palette
  "#999999"  # Add one more custom color (you can choose any hex color code)
)

# Your plotting code with the custom palette
ggplot(date_topic_size, aes(x = relative_freq, y = publisher, fill = Assigned_Topic)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
  labs(title = "Topic Dominance Per Newspaper",
       x = "Relative Frequency of Number of Articles",
       y = "Newspapers",
       fill = "Topic Category") +
  theme_minimal() +
  scale_fill_manual(values = custom_palette) +
  theme(
    legend.key.size = unit(0.6, 'cm'),  # Adjusts the size of the legend keys
    legend.text = element_text(size = 9),  # Adjusts the size of the legend text
    legend.title = element_text(size = 11)  # Adjusts the size of the legend title
  )

No newspapers seems to give more gravity to Aid related topics than the other. The differences are mainly in topics related to the world economy & sanctions, were as expected, the Wall Street Journal seems to publish significantly more articles related to that topic. Articles on the Israel-Hamas and Russian-Ukraine conflicts dominate in both newspapers.

Topics Fluctuation through time from start of Israel-Hamas conflict until May 2024

The Alluvial plot [14], is a great way to observe how topics fluctuate across time, since the start of the Israel-Hamas conflict.

#### Topics Through Time Nov23 to May24

articles_distribution_per_date <- date_topic_size %>% 
  select(publisher,date,Assigned_Topic,count) %>% 
  group_by(date,Assigned_Topic) %>% 
  summarise(
    articles_per_date_per_topic = sum(count),
    .groups = 'drop'
    ) 

total_articles_per_date <- articles_distribution_per_date %>% 
  group_by(date) %>% 
  summarize(
    articles_per_date = sum(articles_per_date_per_topic)
  )

articles_distribution_per_date <- merge(
  articles_distribution_per_date, total_articles_per_date, by = "date", all = FALSE) %>% 
  mutate(
    topic_weight_per_date = articles_per_date_per_topic / articles_per_date
  )

library(alluvial)

unique(articles_distribution_per_date$date)
## [1] "2023-11-01 UTC" "2023-12-01 UTC" "2024-01-01 UTC" "2024-02-01 UTC"
## [5] "2024-03-01 UTC" "2024-04-01 UTC" "2024-05-01 UTC"
articles_distribution_per_date$date <- factor(articles_distribution_per_date$date, 
                                              levels = unique(articles_distribution_per_date$date))

articles_distribution_per_date <- articles_distribution_per_date %>% 
  select(Assigned_Topic,date,topic_weight_per_date)

cols <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b",
          "#e377c2", "#7f7f7f", "#bcbd22", "#17becf", "#aec7e8", "#ffbb78","gray7")

alluvial_ts(articles_distribution_per_date, wave = .3, ygap = 5, col = cols, plotdir = 'centred', alpha=.9,
            grid = TRUE, grid.lwd = 5, xmargin = 0.2, lab.cex = .7, xlab = '',
            ylab = '', border = NA, axis.cex = .8, leg.cex = .7,
            leg.col='white', 
            title = "Topic Trends Across Time\nFrom November 2023 to May 2024\n") 

The above plot can be used to validate if indeed the topics are allocated correctly, by looking if the spikes in articles’ frequency of a specific topic at a specific time, indeed match an observed event in real time, that caused that spike at the specific time period. This form of validation is called predictive validity [17]. From the time series plot above, we can observe that articles on Israel-Iran tensions show a high frequency starting from April 2024, which indeed coincides with the aerial attacks of Iran on Israel.

8. Sentiment Analysis on Articles with Headlines About Funding and Support

The goal is to predict the sentiment of articles related to Aid and Support provided in both the Russian-Ukraine and Israel-Hamas conflict. The aim is to compare newspapers in terms of their sentiment towards Aid provided in either conflict. The headlines of the articles were used to predict the sentiment of each article. A dictionary method was chosen for the sentiment prediction. In the dictionary method, a pre-existing dictionary is used, where the keys correspond to words and the values to the emotion related to that word. Sentiment is assigned to an article by counting the frequency of positive or negative tokens appearing in a give headline. The NRC sentiment dictionary [16] was chosen as the pre-defined lexicon in our analysis.

Identify Topics Related to Funding and Aid

Topics with Higher Frequency of the word “Aid”
Topics with Higher Frequency of the word “Aid”
Topics with Higher Frequency of the word “Fund”
Topics with Higher Frequency of the word “Fund”

From the PyLDAvis visualization, we observe that the topics 1,2,3,4,10 and 12 have the highest appearance of the word “Aid”. These topics correspond to topics 2,5,10,11,12 and 15 from the LDA model output. The word “fund” additionaly appears in topics 13 and 15, which correspond to topics 13 and 14 of the topic model.

In the next step, only articles belonging to topics 2,5,10,11,12,13,14,15. More precisely, these numbered topics correspond to the following labels: “Israel-Hamas War Front”, “U.S. Politics-Elections”, “U.S.A-China Diplomacy on Israel”,“Hostages & Ceasefire Negotiations”,“U.S. Politics-War Aid”,“Impact on World Economy - Sanctions”,“Israel-Hamas Conflict”,“U.S & E.U. Politics-War Aid”.

Filter the articles based on these Topics

# Filter only the desired topics regarding funding and aid 
tpd_labels_idf <- tpd_labels %>% filter(Assigned_Topic %in% c("Israel-Hamas War Front",
                                                              "U.S. Politics-Elections",
                                                              "U.S.A-China Diplomacy on Israel",
                                                              "Hostages & Ceasefire Negotiations",
                                                              "U.S. Politics-War Aid",
                                                              "Impact on World Economy - Sanctions",
                                                              "Israel-Hamas Conflict",
                                                              "U.S & E.U. Politics-War Aid"))

head(tpd_labels_idf)

Identify the most important words in the articles’ headlines using a TF-IDF matrix

The TF-IDF matrix is calculated as in [7]. The TF-IDF matrix helps identify the most important words. Importance is defined as a combination of frequency across documents (articles) and frequency within each document (article). Basically, a penalty is assigned, when a token appears in many documents. For example, a stop word would appear a lot in a document and thus could be important but it would also appear in almost all the documents, which negatively influences it’s importance.

colnames(tpd_labels_idf) <- c('date','text','publisher','assigned_topic')

tf_idf = corpus(tpd_labels_idf) %>% 
  tokens() %>% 
  dfm() %>% 
  dfm_tfidf(scheme_tf="prop", smoothing=1)

tf_idf_dataframe <- as.data.frame(as.matrix(tf_idf))

# Find the highest TF-IDF scores for each document
important_words <- tf_idf_dataframe %>%
  summarise(across(everything(), max, na.rm = TRUE)) %>%
  pivot_longer(cols = everything(), names_to = "term", values_to = "tf_idf") %>%
  arrange(desc(tf_idf))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), max, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
# Print the top 10 most important words
head(important_words, 10)

The TF-IDF matrix did not help identify important words regarding funding and aid. A more direct approach will be followed. Articles that mention words such as “aid” and “fund” in either the headlines or text will be used for the sentiment analysis. Articles that mention words such as “aid”,“fund”,“assistance”,“support”, “package” and “bill” in their text and articles that additionally have the dollar sign, $, in the headlines, will be investigated in terms of their sentiment (positive or negative).

Keep articles that mention words related to the terms funding and aid

Clean the headlines of the articles

newspapers_titles <- newspapers %>% select(Month_Year,Publisher,Title)

colnames(newspapers_titles) <- c("Month_Year","Publisher","text")

corpus_titles <- corpus(newspapers_titles)

tokens_titles = corpus_titles %>%
  tokens() %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("english")) %>%
  tokens_replace(pattern = lexicon::hash_lemmas$token, replacement = lexicon::hash_lemmas$lemma) %>%
  tokens_select(min_nchar = 2)

# Create a list of the tokens for each document and assign it as a column to the dataframe  

vector_vectors <- list()
for (i in 1:nrow(newspapers_titles)) {
  token_vector <- tokens_titles[[i]]
  vector_vectors <- c(vector_vectors, list(token_vector))
  
}

newspapers_titles$tokens<- vector_vectors

# Join the tokens list into text and replace the old text column 

combine_tokens <- function(token_list) {
  joined = str_c(token_list, collapse=" ")
  
  return(joined)
}

newspapers_titles$text <- sapply(newspapers_titles$tokens, combine_tokens)

head(newspapers_titles[,3:4])
titles <- newspapers_titles$text
newspapers$titles <- titles

Use of general expressions to identify funding related articles

Articles are investigated both by their headlines and main text. Articles with Funding and Aid related words in their text or headlines are chosen.

funding_terms_text <- c("aid","fund","support","bill","assistance","package",
                         "billion")

pattern <- paste0("\\b(", paste(funding_terms_text, collapse = "|"), ")\\b")

identify_funding <- function(text) { 
  
  if (grepl(pattern, text)) {
    
    return(TRUE)
  } else {
    return(FALSE)
  }
}

newspapers_aid <- newspapers %>%
  mutate(
    funding_topic_title = unlist(map(titles, identify_funding)),
    funding_topic_text = unlist(map(text,identify_funding))# unlist creates a vector 
  )

head(newspapers_aid)
sum(newspapers_aid$funding_topic_title)
## [1] 368
sum(newspapers_aid$funding_topic_text)
## [1] 1021

Filter by looking at both main text and headlines

newspapers_aid <- newspapers_aid %>% filter(
  funding_topic_text == TRUE | funding_topic_title == TRUE
)

newspapers_aid <- newspapers_aid %>% select(Month_Year, Title, titles, Publisher)

colnames(newspapers_aid) <- c("month_year","title","title_text","publisher")
newspapers_aid <- merge(newspapers_aid, tpd_labels ,by="title")

newspapers_aid <- newspapers_aid %>% select(date, month_year, title, title_text, publisher.x, Assigned_Topic)

colnames(newspapers_aid) <- c("date", "month", "title", "title_text", "publisher", "topic")
newspapers_aid <- newspapers_aid %>% mutate(
  conflict = case_when(topic=="Hostages & Ceasefire Negotiations" ~ "Israel Conflict",
                       topic=="Israel-Hamas Conflict" ~ "Israel Conflict",
                       topic=="Israel-Hamas War Front" ~ "Israel Conflict",
                       topic=="Israel-Iran Tensions "  ~ "Israel Conflict",
                       topic=="Ukraine-Russia War Front" ~ "Ukraine Conflict",
                       TRUE ~ topic)
)

ukraine_terms <- c("ukraine","russia","ukrainian","zelensky","putin","ukrai")
israel_terms <- c("israel","israeli","jewish","netanyahu","gaza")

pattern_1 <- paste0("\\b(", paste(ukraine_terms, collapse = "|"), ")\\b")
pattern_2 <- paste0("\\b(", paste(israel_terms, collapse = "|"), ")\\b")

identify_funding <- function(text) { 
  
  if (grepl(pattern_1, text)) {
    
    return("Ukraine Conflict")
    
  } else if (grepl(pattern_2, text)) {
    
    return("Israel Conflict")
    } else {
         return(text)
    }
  }

newspapers_aid <- newspapers_aid %>%
  mutate(
    conflict_2 = unlist(map(title_text, identify_funding)),
)

newspapers_aid <- newspapers_aid %>% filter(
  conflict_2 %in% c("Ukraine Conflict","Israel Conflict") | conflict %in% c("Ukraine Conflict", "Israel Conflict")
)
newspapers_aid <- newspapers_aid %>% filter(
  conflict_2 %in% c("Ukraine Conflict","Israel Conflict")) %>% select(date, month, title, title_text, publisher, conflict_2)

colnames(newspapers_aid) <- c("date","month","title","cleaned_text","publisher","topic")

head(newspapers_aid)
# write.csv(newspapers_aid, "newspapers_titles.csv", row.names = FALSE)

Loading the dataset for Sentiment Analysis

# Load the data
data <- read.csv("/Users/alessandrosalvatori/Desktop/KU LEUVEN/EXAMS/SECOND YEAR/RETAKES/COLLECTING AND ANALYZING BIG DATA FOR SOCIAL SCIENCES/PROJECT/Sentiment Analysis/newspapers_titles.csv", stringsAsFactors = FALSE)

Additional Text Preprocessing

# Preprocessing the text data
# Convert text to lowercase
data$cleaned_text <- tolower(data$cleaned_text)

# Remove punctuation, numbers, and stopwords
data$cleaned_text <- removePunctuation(data$cleaned_text)
data$cleaned_text <- removeNumbers(data$cleaned_text)
data$cleaned_text <- removeWords(data$cleaned_text, stopwords("en"))

Sentiment Analysis using Dictionary Approach

# Apply NRC sentiment analysis
nrc_sentiments <- get_nrc_sentiment(data$cleaned_text)

# Add the sentiment scores to the original data
data <- cbind(data, nrc_sentiments)

# Aggregate sentiment data to get overall positive and negative sentiment for each article
data$sentiment <- ifelse(data$positive > data$negative, "positive", "negative")
# Summarize the data by publisher and topic
sentiment_summary <- data %>%
  group_by(publisher, topic, sentiment) %>%
  summarise(count = n(), .groups = "drop")

Looking at the results about the Israel conflict, both newspapers show a significant skew towards negative sentiment. However, the Wall Street Journal has a much stronger negative bias compared to The New York Times. The ratio of negative to positive coverage in Wall Street Journal is slightly more pronounced than in The New York Times, indicating that Wall Street Journal might be more critical about this conflict. In contrast to the Israel conflict, the sentiment is more balanced for the Ukraine conflict, especially in The New York Times, where positive articles (88) slightly outweigh negative ones (78). This suggests a more optimistic portrayal of the Ukraine conflict in The New York Times. Wall Street Journal still shows a bias towards negative sentiment, but the difference is less stark than in the other conflict.

The differences in how the two newspapers cover both conflicts can offer valuable insights into their editorial policies and the audiences they try to reach. For example, The New York Times might aim for a more balanced or optimistic view in some cases, while the second one might focus more on negative aspects.

Visualizations of the results

The following code is used to visualize the results obtained from the sentiment analysis. The R package ggplot was used.

# Bar plot for comparing sentiment across topics
ggplot(sentiment_summary, aes(x = topic, y = count, fill = sentiment)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ publisher) +
  labs(title = "Sentiment Comparison by Topic and Publisher",
       x = "Topic",
       y = "Sentiment Count") +
  scale_fill_manual(values = c("positive" = "lightgreen", "negative" = "#FF7074"))

This graph is a barplot showing the distribution of positive and negatives articles of the two newspapers on the two conflicts. They give a better understanding of the conclusions already made previously.

These results highlight significant differences in how these two major newspapers cover the Israel and Ukraine conflicts. The New York Times tends towards a more balanced portrayal, especially in the Ukraine conflict, while Wall Street Journal shows a stronger negative bias in both conflicts. These differences can influence public perception and provide insights into the editorial strategies of these publications.

9. Limitations

10. Conclusion

From the Topic Modeling analysis no significant differences regarding aid and funding topics were observed between the two newspapers. From the time series of the topics since the start of the war (alluvial plot) and from the frequency of articles plot (bar plot), it can be observed that the Israel-Hamas Conflict remains the most dominant topic across newspapers, in terms of the articles’ frequency. The sentiment analysis appears to be more informative on answering our research questions. More meaningful results were obtained from the sentiment analysis, were it seems that the Wall Street Journal, with more conservative ideals, tends to have a more negative feeling towards the financial support and aid in these two conflicts. The results from the sentiment analysis are in line with what we would expect, being the Wall Street Journal more conservative and the New York Times more liberal, and they seem to confirm that there might be some bias in newspapers when addressing these topics.

References

[1] “Ukraine’s counteroffensive against Russia in maps: latest updates.” Accessed: Jul. 07, 2024. [Online]. Available: https://www.ft.com/content/4351d5b0-0888-4b47-9368-6bc4dfbccbf5

[2] “How Much U.S. Aid Is Going to Ukraine? | Council on Foreign Relations.” Accessed: Jul. 07, 2024. [Online]. Available: https://www.cfr.org/article/how-much-us-aid-going-ukraine

[3] “Why are some Republicans opposing more aid for Ukraine?,” Dec. 07, 2023. Accessed: Jul. 07, 2024. [Online]. Available: https://www.bbc.com/news/world-us-canada-67649497

[4] R. W. Austin Moira Fagan, Sneha Gubbala and Sarah, “1. Views of Ukraine and U.S. involvement with the Russia-Ukraine war,” Pew Research Center. Accessed: Jul. 07, 2024. [Online]. Available: https://www.pewresearch.org/global/2024/05/08/views-of-ukraine-and-u-s-involvement-with-the-russia-ukraine-war/

[5] “APIs | Dev Portal.” Accessed: Aug. 09, 2024. [Online]. Available: https://developer.nytimes.com/apis

[6] D. Altschiller, “Research: WR150: Educated Electorate: Newspapers - which way do they lean?” Accessed: Jul. 07, 2024. [Online]. Available: https://library.bu.edu/blumenthal/bias

[7] W. van A. Arcila Damian Trilling &. Carlos, “Computational Analysis of Communication.” Accessed: May 16, 2024. [Online]. Available: https://cssbook.net/

[8] R. Arun, V. Suresh, C. E. Veni Madhavan, and M. N. Narasimha Murthy, “On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations,” in Advances in Knowledge Discovery and Data Mining, vol. 6118, M. J. Zaki, J. X. Yu, B. Ravindran, and V. Pudi, Eds., in Lecture Notes in Computer Science, vol. 6118. , Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 391–402. doi: 10.1007/978-3-642-13657-3_43.

[9] J. Cao, T. Xia, J. Li, Y. Zhang, and S. Tang, “A density-based method for adaptive LDA model selection,” Neurocomputing, vol. 72, no. 7–9, pp. 1775–1781, Mar. 2009, doi: 10.1016/j.neucom.2008.06.011.

[10] R. Deveaud, E. SanJuan, and P. Bellot, “Accurate and effective latent concept modeling for ad hoc information retrieval,” Document numérique, vol. 17, no. 1, pp. 61–84, Apr. 2014, doi: 10.3166/dn.17.1.61-84.

[11] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. Natl. Acad. Sci. U.S.A., vol. 101, no. suppl_1, pp. 5228–5235, Apr. 2004, doi: 10.1073/pnas.0307752101.

[12] F. tang, “Beginner’s Guide to LDA Topic Modelling with R,” Medium. Accessed: Aug. 09, 2024. [Online]. Available: https://towardsdatascience.com/beginners-guide-to-lda-topic-modelling-with-r-e57a5a8e7a25

[13] C. Sievert, cpsievert/LDAvis. (Jul. 10, 2024). JavaScript. Accessed: Aug. 09, 2024. [Online]. Available: https://github.com/cpsievert/LDAvis

[14] M. Bojanowski, mbojan/alluvial. (Jul. 16, 2024). R. Accessed: Aug. 09, 2024. [Online]. Available: https://github.com/mbojan/alluvial

[15] P. Ghasiya and K. Okamura, “Understanding the Middle East through the eyes of Japan’s Newspapers: A topic modelling and sentiment analysis approach,” Digital Scholarship in the Humanities, vol. 36, no. 4, pp. 871–885, Dec. 2021, doi: 10.1093/llc/fqab019.

[16] Canada, G. of C. N. R. C. (2024, August 31). NRC emotion lexicon—NRC Publications Archive. https://nrc-publications.canada.ca/eng/view/object/?id=0b6a5b58-a656-49d3-ab3e-252050a7a88c

[17] Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028

[18] Wankhade, M., Rao, A. C. S., & Kulkarni, C. (2022). A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, 55(7), 5731–5780. https://doi.org/10.1007/s10462-022-10144-1